Distributed Non-Parametric Representations for Vital Filtering: UW at TREC KBA 2014

نویسندگان

  • Ignacio Cano
  • Sameer Singh
  • Carlos Guestrin
چکیده

Identifying documents that contain timely and vital information for an entity of interest, a task known as vital filtering, has become increasingly important with the availability of large document collections. To efficiently filter such large text corpora in a streaming manner, we need to compactly represent previously observed entity contexts, and quickly estimate whether a new document contains novel information. Existing approaches to modeling contexts, such as bag of words, latent semantic indexing, and topic models, are limited in several respects: they are unable to handle streaming data, do not model the underlying topic of each document, suffer from lexical sparsity, and/or do not accurately estimate temporal vitalness. In this paper, we introduce a word embedding-based non-parametric representation of entities that addresses the above limitations. The word embeddings provide accurate and compact summaries of observed entity contexts, further described by topic clusters that are estimated in a non-parametric manner. Additionally, we associate a staleness measure with each entity and topic cluster, dynamically estimating their temporal relevance. This approach of using word embeddings, non-parametric clustering, and staleness provides an efficient yet appropriate representation of entity contexts for the streaming setting, enabling accurate vital filtering.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

WHU at TREC KBA Vital Filtering Track 2014

This paper describes the WHU IRLAB participation to the Vital Filtering task of the TREC 2014 Knowledge Base Acceleration Track. In this task, we implemented a system to detect vital documents that could be used for a human editor to update or create the profile of an entity. Our approach is to view the problem as a classification problem and use Stanford NLP Toolkit to extract necessary inform...

متن کامل

MSR KMG at TREC 2014 KBA Track Vital Filtering Task

In this paper, we present our strategy for TREC 2014 KBA track Vital Filtering task. This task is also known as "Cumulative Citation Recommendation" or "CCR" in 2012 and 2013. Vital Filtering task is to identify "vital" documents containing timely and new information that should be used to update the profile of a given entity (also called a topic). Our strategy for vital filtering is to first r...

متن کامل

BIT and Purdue at TREC-KBA-CCR Track 2014

This report summarizes our participation at KBA-CCR track in TREC 2014. Our submissions are generated in two steps: (1) Filtering a candidate documents collection from the stream corpus for a set of target entities; and (2) Estimating the relevance levels between candidate documents and target entities. Three kinds of approaches are employed in the second step, including query expansion, classi...

متن کامل

IRIT at TREC KBA 2014

This paper describes the IRIT lab participation to the Vital Filtering task (also known as Cumulative Citation Recommendation) of the TREC 2014 Knowledge Base Acceleration Track. This task aims at identifying vital documents containing timely new information that should help a human to update the profile of the target entity (e.g., Wikipedia page of the entity). In this work, we evaluate two fa...

متن کامل

Evaluating Stream Filtering for Entity Profile Updates in TREC 2012, 2013, and 2014

The Knowledge Base Acceleration (KBA) track ran in TREC 2012, 2013, and 2014 as an entitycentric filtering evaluation. This track evaluates systems that filter a time-ordered corpus for documents and slot fills that would change an entity profile in a predefined list of entities. Compared with the 2012 and 2013 evaluations, the 2014 evaluation introduced several refinements, including high-qual...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014